Find Repeat, Space Edition (Practice Interview Question)

You're out of free questions.

Find a duplicate, Space Edition™.

We have a list of integers, where:

The integers are in the range $1..n$
The list has a length of $n+1$

It follows that our list has at least one integer which appears at least twice. But it may have several duplicates, and each duplicate may appear more than twice.

Write a function which finds an integer that appears more than once in our list. Don't modify the input! (If there are multiple duplicates, you only need to find one of them.)

We're going to run this function on our new, super-hip MacBook Pro With Retina Display™. Thing is, the damn thing came with the RAM soldered right to the motherboard, so we can't upgrade our RAM. So we need to optimize for space!

Gotchas

We can do this in $O(1)$ space.

We can do this in less than $O(n^2)$ time while keeping $O(1)$ space.

We can do this in $O(n\lg{n})$ time and $O(1)$ space.

We can do this without modifying the input.

Most $O(n\lg{n})$ algorithms double something or cut something in half. How can we rule out half of the numbers each time we iterate through the list?

Breakdown

This one's a classic! We just do one walk through the list, using a set to keep track of which items we've seen!

  def find_repeat(numbers):
    numbers_seen = set()
    for number in numbers:
        if number in numbers_seen:
            return number
        else:
            numbers_seen.add(number)

    # Whoops--no duplicate
    raise Exception('no duplicate!')

Bam. $O(n)$ time and ... $O(n)$ space ...

Right, we're supposed to optimize for space. $O(n)$ is actually kinda high space-wise. Hm. We can probably get $O(1)$ ...

We can "brute force" this by taking each number in the range $1..n$ and, for each, walking through the list to see if it appears twice.

  def find_repeat_brute_force(numbers):
    for needle in range(1, len(numbers)):
        has_been_seen = False
        for number in numbers:
            if number == needle:
                if has_been_seen:
                    return number
                else:
                    has_been_seen = True

    # Whoops--no duplicate
    raise Exception('no duplicate!')

This is $O(1)$ space and $O(n^2)$ time.

That space complexity can't be beat, but the time cost seems a bit high. Can we do better?

One way to beat $O(n^2)$ time is to get $O(n\lg{n})$ time. Sorting takes $O(n\lg{n})$ time. And if we sorted the list, any duplicates would be right next to each-other!

But if we start off by sorting our list we'll need to take $O(n)$ space to store the sorted list...

...unless we sort the input list in place! ↴

An in-place function modifies data structures or objects outside of its own stack frame ↴

Overview

The call stack is what a program uses to keep track of function calls. The call stack is made up of stack frames—one for each function call.

For instance, say we called a function that rolled two dice and printed the sum.

  def roll_die():
    return random.randint(1, 6)

def roll_two_and_sum():
    total = 0
    total += roll_die()
    total += roll_die()
    print(total)

roll_two_and_sum()

First, our program calls roll_two_and_sum(). It goes on the call stack:

roll_two_and_sum()

That function calls roll_die(), which gets pushed on to the top of the call stack:

roll_die()

roll_two_and_sum()

Inside of roll_die(), we call random.randint(). Here's what our call stack looks like then:

random.randint()

roll_die()

roll_two_and_sum()

When random.randint() finishes, we return back to roll_die() by removing ("popping") random.randint()'s stack frame.

roll_die()

roll_two_and_sum()

Same thing when roll_die() returns:

roll_two_and_sum()

We're not done yet! roll_two_and_sum() calls roll_die() again:

roll_die()

roll_two_and_sum()

Which calls random.randint() again:

random.randint()

roll_die()

roll_two_and_sum()

random.randint() returns, then roll_die() returns, putting us back in roll_two_and_sum():

roll_two_and_sum()

Which calls print()():

print()()

roll_two_and_sum()

What's stored in a stack frame?

What actually goes in a function's stack frame?

A stack frame usually stores:

Local variables
Arguments passed into the function
Information about the caller's stack frame
The return address—what the program should do after the function returns (i.e.: where it should "return to"). This is usually somewhere in the middle of the caller's code.

Some of the specifics vary between processor architectures. For instance, AMD64 (64-bit x86) processors pass some arguments in registers and some on the call stack. And, ARM processors (common in phones) store the return address in a special register instead of putting it on the call stack.

The Space Cost of Stack Frames

Each function call creates its own stack frame, taking up space on the call stack. That's important because it can impact the space complexity of an algorithm. Especially when we use recursion.

For example, if we wanted to multiply all the numbers between $1$ and $n$ , we could use this recursive approach:

  def product_1_to_n(n):
    return 1 if n <= 1 else n * product_1_to_n(n - 1)

What would the call stack look like when n = 10?

First, product_1_to_n() gets called with n = 10:

    product_1_to_n()    n = 10

This calls product_1_to_n() with n = 9.

    product_1_to_n()    n = 9

    product_1_to_n()    n = 10

Which calls product_1_to_n() with n = 8.

    product_1_to_n()    n = 8

    product_1_to_n()    n = 9

    product_1_to_n()    n = 10

And so on until we get to n = 1.

    product_1_to_n()    n = 1

    product_1_to_n()    n = 2

    product_1_to_n()    n = 3

    product_1_to_n()    n = 4

    product_1_to_n()    n = 5

    product_1_to_n()    n = 6

    product_1_to_n()    n = 7

    product_1_to_n()    n = 8

    product_1_to_n()    n = 9

    product_1_to_n()    n = 10

Look at the size of all those stack frames! The entire call stack takes up $O(n)$ space. That's right—we have an $O(n)$ space cost even though our function itself doesn't create any data structures!

What if we'd used an iterative approach instead of a recursive one?

  def product_1_to_n(n):
    # We assume n >= 1
    result = 1
    for num in range(1, n + 1):
        result *= num

    return result

This version takes a constant amount of space. At the beginning of the loop, the call stack looks like this:

    product_1_to_n()    n = 10, result = 1, num = 1

As we iterate through the loop, the local variables change, but we stay in the same stack frame because we don't call any other functions.

    product_1_to_n()    n = 10, result = 2, num = 2

    product_1_to_n()    n = 10, result = 6, num = 3

    product_1_to_n()    n = 10, result = 24, num = 4

In general, even though the compiler or interpreter will take care of managing the call stack for you, it's important to consider the depth of the call stack when analyzing the space complexity of an algorithm.

Be especially careful with recursive functions! They can end up building huge call stacks.

What happens if we run out of space? It's a stack overflow! In Python 3.6, you'll get a RecursionError.

If the very last thing a function does is call another function, then its stack frame might not be needed any more. The function could free up its stack frame before doing its final call, saving space.

This is called tail call optimization (TCO). If a recursive function is optimized with TCO, then it may not end up with a big call stack.

In general, most languages don't provide TCO. Scheme is one of the few languages that guarantee tail call optimization. Some Ruby, C, and Javascript implementations may do it. Python and Java decidedly don't.

(i.e.: stored on the process heap or in the stack frame of a calling function). Because of this, the changes made by the function remain after the call completes.

In-place algorithms are sometimes called destructive, since the original input is "destroyed" (or modified) during the function call.

Careful: "In-place" does not mean "without creating any additional variables!" Rather, it means "without creating a new copy of the input." In general, an in-place function will only create additional variables that are $O(1)$ space.

An out-of-place function doesn't make any changes that are visible to other functions. Usually, those functions copy any data structures or objects before manipulating and changing them.

In many languages, primitive values (integers, floating point numbers, or characters) are copied when passed as arguments, and more complex data structures (lists, heaps, or hash tables) are passed by reference. This is what Python does.

Here are two functions that do the same operation on a list, except one is in-place and the other is out-of-place:

  def square_list_in_place(int_list):
    for index, element in enumerate(int_list):
        int_list[index] *= element

    # NOTE: no need to return anything - we modified
    # int_list in place


def square_list_out_of_place(int_list):
    # We allocate a new list with the length of the input list
    squared_list = [None] * len(int_list)

    for index, element in enumerate(int_list):
        squared_list[index] = element ** 2

    return squared_list

Working in-place is a good way to save time and space. An in-place algorithm avoids the cost of initializing or copying data structures, and it usually has an $O(1)$ space cost.

But be careful: an in-place algorithm can cause side effects. Your input is "destroyed" or "altered," which can affect code outside of your function. For example:

  original_list = [2, 3, 4, 5]
square_list_in_place(original_list)

print("original list: %s" % original_list)
# Prints: original list: [4, 9, 16, 25], confusingly!

Generally, out-of-place algorithms are considered safer because they avoid side effects. You should only use an in-place algorithm if you're space constrained or you're positive you don't need the original input anymore, even for debugging.

Okay, so this'll work:

Do an in-place sort of the list (for example an in-place merge sort).
Walk through the now-sorted list from left to right.
Return as soon as we find two adjacent numbers which are the same.

This'll keep us at $O(1)$ space and bring us down to $O(n\lg{n})$ time.

But modifying the input is kind of a drag—it might cause problems elsewhere in our code. Can we maintain this time and space cost without modifying the input?

Let's take a step back. How can we break this problem down into subproblems?

If we're going to do $O(n\lg{n})$ time, we'll probably be iteratively doubling something or iteratively cutting something in half. That's how we usually get a " $\lg{n}$ ". So what if we could cut the problem in half somehow?

Well, binary search ↴

A binary search algorithm finds an item in a sorted list in $O(lg(n))$ time.

A brute force search would walk through the whole list, taking $O(n)$ time in the worst case.

Let's say we have a sorted list of numbers. To find a number with a binary search, we:

Start with the middle number: is it bigger or smaller than our target number? Since the list is sorted, this tells us if the target would be in the left half or the right half of our list.
We've effectively divided the problem in half. We can "rule out" the whole half of the list that we know doesn't contain the target number.
Repeat the same approach (of starting in the middle) on the new half-size problem. Then do it again and again, until we either find the number or "rule out" the whole set.

We can do this recursively, or iteratively. Here's an iterative version:

  def binary_search(target, nums):
    """See if target appears in nums"""

    # We think of floor_index and ceiling_index as "walls" around
    # the possible positions of our target, so by -1 below we mean
    # to start our wall "to the left" of the 0th index
    # (we *don't* mean "the last index")
    floor_index = -1
    ceiling_index = len(nums)

    # If there isn't at least 1 index between floor and ceiling,
    # we've run out of guesses and the number must not be present
    while floor_index + 1 < ceiling_index:

        # Find the index ~halfway between the floor and ceiling
        # We use integer division, so we'll never get a "half index"
        distance = ceiling_index - floor_index
        half_distance = distance // 2
        guess_index = floor_index + half_distance

        guess_value = nums[guess_index]

        if guess_value == target:
            return True

        if guess_value > target:
            # Target is to the left, so move ceiling to the left
            ceiling_index = guess_index
        else:
            # Target is to the right, so move floor to the right
            floor_index = guess_index

    return False

How did we know the time cost of binary search was $O(lg(n))$ ? The only non-constant part of our time cost is the number of times our while loop runs. Each step of our while loop cuts the range (dictated by floor_index and ceiling_index) in half, until our range has just one element left.

So the question is, "how many times must we divide our original list size ( $n$ ) in half until we get down to 1?"

n * \frac{1}{2} * \frac{1}{2} * \frac{1}{2} * \frac{1}{2} * ... = 1

How many $\frac{1}{2}$ 's are there? We don't know yet, but we can call that number $x$ :

n * (\frac{1}{2})^x = 1

Now we solve for $x$ :

n * \frac{1^x}{2^x} = 1

n * \frac{1}{2^x} = 1

\frac{n}{2^x} = 1

n = 2^x

Now to get the $x$ out of the exponent. How do we do that? Logarithms.

Recall that $\log_{10}{100}$ means, "what power must we raise 10 to, to get 100"? The answer is 2.

So in this case, if we take the $\log_{2}$ of both sides...

\log_{2}{n} = \log_{2}{2^x}

The right hand side asks, "what power must we raise $2$ to, to get $2^x$ ?" Well, that's just $x$ .

\log_{2}{n} = x

So there it is. The number of times we must divide $n$ in half to get down to 1 is $log_{2}n$ . So our total time cost is $O(lg(n))$

Careful: we can only use binary search if the input list is already sorted.

works by cutting the problem in half after figuring out which half of our input list holds the answer.

But in a binary search, the reason we can confidently say which half has the answer is because the list is sorted. For this problem, when we cut our unsorted list in half we can't really make any strong statements about which elements are in the left half and which are in the right half.

What if we could cut the problem in half a different way, other than cutting the list in half?

With this problem, we're looking for a needle (a repeated number) in a haystack (list). What if instead of cutting the haystack in half, we cut the set of possibilities for the needle in half?

The full range of possibilities for our needle is $1..n$ . How could we test whether the actual needle is in the first half of that range ( $1..\frac{n}{2}$ ) or the second half ( $\frac{n}{2}+1..n$ )?

A quick note about how we're defining our ranges: when we take $\frac{n}{2}$ we're doing integer division, so we throw away the remainder. To see what's going on, we should look at what happens when $n$ is even and when $n$ is odd:

If $n$ is $6$ (an even number), we have $\frac{n}{2}=3$ and $\frac{n}{2}+1=4$ , so our ranges are $1..3$ and $4..6$ .
If $n$ is $5$ (an odd number), $\frac{n}{2}=2$ (we throw out the remainder) and $\frac{n}{2}+1=3$ , so our ranges are $1..2$ and $3..5$ .

So we can notice a few properties about our ranges:

They aren't necessarily the same size.
They don't overlap.
Taken together, they represent the original input list's range of $1..n$ . In math terminology, we could say their union is $1..n$ .

So, how do we know if the needle is in $1..\frac{n}{2}$ or $\frac{n}{2}+1..n$ ?

Think about the original problem statement. We know that we have at least one repeat because there are $n+1$ items and they are all in the range $1..n$ , which contains only $n$ distinct integers.

This notion of "we have more items than we have possibilities, so we must have at least one repeat" is pretty powerful. It's sometimes called the pigeonhole principle. ↴

The pigeonhole principle states that if $n$ items are put into $m$ containers, with $n > m$ , then at least one container must contain more than one item.

For example, there must be at least two left gloves or two right gloves in a group of three gloves.

Can we exploit the pigeonhole principle to see which half of our range contains a repeat?

Imagine that we separated the input list into two sublists—one containing the items in the range $1..\frac{n}{2}$ and the other containing the items in the range $\frac{n}{2}+1..n$ .

Each sublist has a number of elements as well as a number of possible distinct integers (that is, the length of the range of possible integers it holds).

Given what we know about the number of elements vs the number of possible distinct integers in the original input list, what can we say about the number of elements vs the number of distinct possible integers in these sublists?

The sum of the sublists' numbers of elements is $n+1$ (the number of elements in the original input list) and the sum of the sublists' numbers of possible distinct integers is $n$ (the number of possible distinct integers in the original input list).

Since the sums of the sublists' numbers of elements must be 1 greater than the sum of the sublists' numbers of possible distinct integers, one of the sublists must have at least one more element than it has possible distinct integers.

Not convinced? We can prove this by contradiction. Suppose neither list had more elements than it had possible distinct integers. In other words, both lists have at most the same number of items as they have distinct possibilities. The sum of their numbers of items would then be at most the total number of possibilities across each of them, which is $n$ . This is a contradiction—we know that our total number of items from the original input list is $n+1$ , which is greater than $n$ .

Now that we know one of our sublists has 1 or more items more than it has distinct possibilities, we know that sublist must have at least one duplicate, by the same pigeonhole argument that we use to know that the original input list has at least one duplicate.

So once we know which sublist has the count higher than its number of distinct possibilities, we can use this same approach recursively, cutting that sublist into two halves, etc, until we have just 1 item left in our range.

Of course, we don't need to actually separate our list into sublists. All we care about is how long each sublist would be. So we can simply do one walk through the input list, counting the number of items that would be in each sublist.

Can you formalize this in code?

Careful—if we do this recursively, we'll incur a space cost in the call stack! Do it iteratively instead.

Solution

Our approach is similar to a binary search, except we divide the range of possible answers in half at each step, rather than dividing the list in half.

Find the number of integers in our input list which lie within the range $1..\frac{n}{2}$ .
Compare that to the number of possible unique integers in the same range.
If the number of actual integers is greater than the number of possible integers, we know there’s a duplicate in the range $1..\frac{n}{2}$ , so we iteratively use the same approach on that range.
If the number of actual integers is not greater than the number of possible integers, we know there must be duplicate in the range $\frac{n}{2} + 1..n$ , so we iteratively use the same approach on that range.
At some point, our range will contain just 1 integer, which will be our answer.

  def find_repeat(numbers):
    floor = 1
    ceiling = len(numbers) - 1

    while floor < ceiling:
        # Divide our range 1..n into an upper range and lower range
        # (such that they don't overlap)
        # Lower range is floor..midpoint
        # Upper range is midpoint+1..ceiling
        midpoint = floor + ((ceiling - floor) // 2)
        lower_range_floor, lower_range_ceiling = floor, midpoint
        upper_range_floor, upper_range_ceiling = midpoint+1, ceiling

        # Count number of items in lower range
        items_in_lower_range = 0
        for item in numbers:
            # Is it in the lower range?
            if item >= lower_range_floor and item <= lower_range_ceiling:
                items_in_lower_range += 1

        distinct_possible_integers_in_lower_range = (
            lower_range_ceiling
            - lower_range_floor
            + 1
        )
        if items_in_lower_range > distinct_possible_integers_in_lower_range:
            # There must be a duplicate in the lower range
            # so use the same approach iteratively on that range
            floor, ceiling = lower_range_floor, lower_range_ceiling
        else:
            # There must be a duplicate in the upper range
            # so use the same approach iteratively on that range
            floor, ceiling = upper_range_floor, upper_range_ceiling

    # Floor and ceiling have converged
    # We found a number that repeats!
    return floor

Complexity

$O(1)$ space and $O(n\lg{n})$ time.

Tricky as this solution is, we can actually do even better, getting our runtime down to $O(n)$ while keeping our space cost at $O(1)$ . The solution is NUTS; it's probably outside the scope of what most interviewers would expect. But for the curious...here it is!

Bonus

This function always returns one duplicate, but there may be several duplicates. Write a function that returns all duplicates.

What We Learned

Our answer was a modified binary search. We got there by reasoning about the expected runtime:

We started with an $O(n^2)$ "brute force" solution and wondered if we could do better.
We knew to beat $O(n^2)$ we'd probably do $O(n)$ or $O(n\lg{n})$ , so we started thinking of ways we might get an $O(n\lg{n})$ runtime.
$\lg{(n)}$ usually comes from iteratively cutting stuff in half, so we arrived at the final algorithm by exploring that idea.

Starting with a target runtime and working backward from there can be a powerful strategy for all kinds of coding interview questions.

Do you have an answer?

Show answer

I have an answer Tell me more

Wanna review this one again later? Or do you feel like you got it all?

Mark as done Pin for review later